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Task  1:  Simulating  Genetic  Engineering 

•  Goal:  To  create  datasets  for  testing  our  detection  methods  and  to  share  them  with  the  broader 
scientific  community 

•  No  artificially  engineered  genome  is  available  in  GENBANK 

•  Approach:  Design  computer  programs  to  simulate  tampering  of  genomes 

•  We  used  E.  coli  (K12  strain)  genome  as  the  backbone  in  our  simulations 

•  Create  data  by  tampering  with  the  host  genome 

-  Insertion  of  foreign  gene  into  the  host  genome. 

-  Replacement  of  host  genes  with  orthologous  genes  from  foreign  species. 

•  Varied  the  difficulty  level  for  detection  methods 

-  Tampered  with  genes  from  species  at  various  distances  from  E.  coli. 

Task  2:  Comparative  Genomics  Approach 

•  Goal:  Compare  a  target  genome  with  related  isolates  and  determine  whether  it  has  been 
engineered. 

•  Why  Comparative  Genomics? 

-  We  can  tell  more  about  a  genome  by  comparing  it  with  related  genomes  rather  than  by 
looking  solely  at  the  genome  alone 

-  With  advent  of  large  scale  sequencing,  any  target  genome  is  likely  to  have  a  sequenced 
relative. 

•  More  than  480  genomes  in  GENBANK. 

We  have  developed  computational  pipeline  that  compares  a  target  genome  with  related  genomes  and 
find  regions  that  have  been  potentially  engineered.  Our  pipeline  compares  the  target  genome  with 
related  genomes  and  finds  “unique”  genes  that  have  no  homologs.  These  “unique  genes”  can  then  be 
tested  for  other  criteria  like  DNA  composition  to  narrow  down  the  list  of  potential  engineered  genes. 


Target 

Genome 


Comparison  with 
Related  Genomes 


Task  3:  DNA  Composition  Tests 
Can  be  generalized  to  K-mers: 

It  has  been  known  since  the  1960s  that  different  organisms  have  different  “genome  signatures”.  So 
engineered  DNA  will  have  a  different  “signature”  compared  to  the  signature  of  host  DNA.  The  most 
common  DNA-composition  metric  is  the  Kmer  metrics,  where  we  measure  the  frequency  of  length  K 
words  in  a  sequence.  The  figure  shows  the  dimer  metric,  but  we  can  generalize  it  for  arbitrary  K. 
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•  Use  Hexamer  Frequencies. 

•  Curse  of  Dimensionality 

-  Exponential  growth  of  feature  space  with  higher  K. 

•  Use  PC  A  to  deal  with  the  curse  of  dimensionality 

-  First  few  PCs  good  for  detecting  outliers. 

We  use  hexamer  frequencies  because  it  can  detect  biases  due  to  (i)  codon  bias  (ii)  restriction  enzymes. 
However,  the  dimensionality  of  the  data-set  increases  with  the  size  of  K.  We  deal  with  this  problem  by 
using  Principal  Component  Analysis.  The  first  few  principal  components  are  good  enough  to  detect 
outliers. 


Task  4:  Comparative  Genomics  Pipeline 


Principal  Component  1 


In  this  figure,  we  show  the  first  principal  component  of  the  hexamer  frequency  data  of  IK  fragments 
taken  from  the  E.  coli  genome  engineered  with  B.  anthracis  genes.  The  host  DNA  is  shown  in  red, 
whereas  the  engineered  DNA  is  shown  in  red.  However  this  test  also  detects  other  anomalous  DNA  such 
as  the  ones  due  to  recent  Lateral  Gene  Transfer. 

Operon  Test 

•  Assumption:  Adversary  inserts  a  group  of  contiguous  genes  or  operons  [e.g.  pathogenic  islands]. 

•  Test:  Only  look  for  “unique  genes”  that  occur  in  clusters. 

•  Caveat:  Many  toxins  can  be  single  genes. 

We  have  also  developed  an  operon  test  in  which  we  look  for  clusters  of  unique  genes.  This  is  because 
many  “related  genes”  like  pathogenic  islands  occurs  in  clusters  in  prokaryotes. 

Testing 

•  Use  Simulated  Data  Sets 


-  Insert  20  foreign  genes  from  Bacillus  anthracis  to  Escherichia  coli  K12  genome. 

•  Goal:  To  filter  the  list  of  “candidate  engineered  genes”. 

To  test  our  methods,  we  simulated  engineering  of  20  Bacillus  anthracis  genes  into  E.  coli  K12  genome. 
Then  we  ran  our  pipeline  through  our  data-set. 

Task  4:  Comparative  Genomics  Pipeline 
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The  results  of  each  step  of  the  pipeline  are  shown  in  blue.  The  initial  whole  genome  has  4263  genes  of 
which  20  are  “engineered”.  After  comparing  with  closely  related  E.  coli  genomes,  our  list  was  narrowed 
to  197  genes,  including  all  20  engineered  genes.  Additional  tests  like  the  DNA  composition  and  operon 
test  further  decreased  the  number  of  candidate  genes. 


Task  5:  Detect  Pathogenic  Islands 


We  seek  to  answer  the  following  question: 


Can  we  learn  functional  motifs 
associated  with  pathogenicity? 
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from  Ecker  et  ah,  “The  Microbial  Rosetta  Stone  Database:  A  compilation  of  global  and  emerging 
infectious  microorganisms  and  bioterrorist  threat  agents,”  BMC  Microbiology,  2005 
This  is  what  we  know  about  the  “family  tree”  of  pathogenic  bacteria  to  date.  We  wished  to  learn 
signatures  of  pathogenicity  both  within  and  across  phyla. 

Construction  of  training  set:  positive  examples 


Rule 

not(repressor  or  transcription [al]  regulator) 

Class  1 

pathogeni  or  pathogene  or  virulen 

Class  2 

secretory  or  “secretion  system''  or  (secretion  and  extracellular) 

Class  3 

iron  and  permease 

Class  4 

(membrane  and  antigen)  or  O-antigen  or  (surface  and  antigen) 

Class  5 

adherence  or  adhesi 

Class  6 

invasi  or  invasol 

Class  7 

toxin 

and  (not  anti[-]toxin  or 

(“  toxin  of  toxin [-/]anti[-]toxin  [system]" 

and  not(“anti[-]toxin  of  toxin[-/]anti[-]toxin  [system]"))) 

and  not  “toxin-binding” 

Class  8 

h[a]emolysin  or  h[a]emolytic  or  “lytic  enzyme” 

Our  SVM  method  needs  training  data  to  leam  signatures  of  pathogenicity.  We  accomplish  this  by 
mining  literature  annotation  for  key  words  associated  with  nasty  genes.  Some  of  our  classes  are  less 
specific  (class  1 -pathogen)  and  some  are  more  specific  (class  8-haemolysin). 

Construction  of  training  set:  negative  examples 

Since  the  function  of  many  genes  is  unknown,  the  best  way  to  be  sure  of  having  non-pathogenic  genes  is 
to  take  a  sampling  of  random  genes  from  nonpathogenic  organisms  for  negative  examples  for  our 
training  set. 


Table  5.  Nonpathogenic  organisms  used  to  create  negative  training  sets  and  their 
phyla. 


Aquifex  aeolicus  VF5 

Aquificae 

Bacillus  halodurans  C- 1 25 

Firmicutes 

Bacillus  subtilis  subsp.  subtilis  str.  168 

Firmicutes 

Caulobacter  crescentus  C'B  1 5 

Proteobacteria 

Chlorobium  tepidum  TLS 

Chlorobi 

Clostridium  acetobutylicum  ATCC  824 

Firmicutes 

Clostridium  thermocellum  ATCC  27405 

Firmicutes 

Corynebacterium  glutamicum  ATCC  13032 

Actinobacteria 

Dehalococcoides  ethenogenes  195 

Chloroflexi 

Deinococcus  radiodurans  R 1 

Deinococcus-Thermus 

Desulfovibrio  vulgaris  subsp.  vulgaris  DP4 

Proteobacteria 

Geobacter  sulfurreducens  PCA 

Proteobacteria 

Photorhabdus  luminescens  subsp.  laumondii  TTOl 

Proteobacteria 

Pseudomonas  putida  KT2440 

Proteobacteria 

Rhodobacter  sphaeroides  2.4. 1 

Proteobacteria 

Shewanella  oneidensis  MR- 1 

Proteobacteria 

Streptomyces  coelicolor  A3(2) 

Actinobacteria 

Synechocystis  sp.  PCC  6803 

Cyanobacteria 

Thermotoga  maritima  MSB8 

Thermotogae 

Thermus  thermophilus  HB8 

Deinococcus-Thermus 

Method 

We  use  support  vector  machines  to  learn  8  pathogenic  classes  of  genes,  using  motifs  consisting  of  very 
short  amino  acid  strings  (simple  string  kernel  method). 

SVMs  leam  classes  in  a  high  dimensional  feature  space.  The  features  that  we  use  are  simply  frequencies 
of  very  short  substrings  of  amino  acids  (3-4  aa’s  long;  including  a  wildcard  character).  It’s  of 
independent  interest  that  such  short  motifs  have  a  functional  signal. 

Proteobacteria 

Here  are  the  results  when  restricted  to  Proteobacteria  in  a  10-fold  cross  validation  study.  The  “Area 
Under  the  Curve”  statistic  measures  the  tradeoff  of  the  true  positive/true  negative  rate  as  a  single  value- 
AUC  of  1  is  perfect  classification  and  50  percent  is  random  chance.  The  p-value  statistics  represent  the 
likelihood  of  seeing  that  AUC  for  that  class  by  chance. 


Class 

AUC  (p- value) 

Pathogenicity/virulence 

0.806  «  1  x  10— 4U) 

Secretion 

0.824  «  1  x  10 4U) 

Iron  permeases 

0.880  «  1  x  10“4U) 

Surface  antigens 

0.808  «  1  x  10“4U) 

Adhesins 

0.648  (4.660  x  10~y) 

Invasins 

0.692  (1.577  x  10“14) 

Toxins 

0.694  «  1  x  10— 40) 

Hemolysis  genes 

0.702  «  1  x  10— 40) 

AUC  for  all  8  classes  in  the  10-fold  cross-validation  study. 


Results:  Leave-phylum-out  cross-validation 

Here  are  the  same  statistics  for  a  “leave  phylum  out”  cross  validation.  This  is  a  much  harder  problem, 
because  we  are  trying  to  leam  what  toxins  in  Actinobacteria  are  from,  for  example,  toxins  in 
Proteobacteria.  We  do  surprisingly  well,  with  a  couple  of  exceptions-  the  classes  we  do  badly  with  are 
marked  with  “1”  -  in  most  cases  the  reason  is  too  few  examples;  the  only  exception  is  we  are  not  doing 
well  at  predicting  invasins  in  Proteobacteria  when  trained  only  on  bacteria  outside  the  Proteobacteria 
phylum.  On  closer  examination  of  the  data,  Proteobacteria  contain  a  set  of  proteins  all  homologous  to 
one  protein  labeled  “invasion-like”  which  is  completely  dissimilar  to  examples  found  outside  the 
phylum.  So  that’s  why  we  do  so  poorly  on  this  class. 


pathogenicity 

secretion 

iron  permeases 

antigens 

Ac  ti  nobacteria 

0.63(9.65  x  It)-4) 

0.63(0.18) 

0.97  (8.19  x  10- 13) 

0.73(0.01) 

Bacteroidetes 

N/A 

0.60  (>) 

LOO  (0.03) 

0.80  (10~3) 

Chlaymidiae 

0.88  (10-3) 

N/A 

N/A 

N/A 

Firmicutes 

0.68  (0.07) 

0.74(4.11  x  10-5) 

0.99  (*) 

0.98  (0.08) 

Proteobacteria 

0.69  (*) 

0.54(4  x  10-3) 

0.97  (*) 

0.70  (*) 

Spirochetes 

0.82  (0.02) 

0.61  (0.49) 

N/A 

0.92  (0.04) 

adhesins 

invasins 

toxins 

hemolysis 

Actinobacteria 

0.72(2  x  10-3) 

0.74(0.02) 

0.47  t1) 

0.66  (0.02) 

Bacteroidetes 

N/A 

N/A 

0.97  (0.08) 

0.77  (3  x  10-3) 

Chlaymidiae 

0.80  (0.01) 

N/A 

N/A 

0.81  (0.31) 

Firmicutes 

0.87  (*) 

0.79  (0.09) 

0.85  (*) 

0.78(6.83  x  lO"12) 

Proteobacteria 

0.79  (*) 

0.43  (>) 

0.72  (*) 

0.67  (2.22  x  10-16) 

Spirochetes 

0.99  (0.04) 

0.95  (0. 1 1 ) 

N/A 

0.65  (3  x  10-3) 

AUC  for  all  48  components  of  the  leave-phylum-out  cross-validation  study. 

Here  (')  means  the  observed  AUC  had  no  significance,  (*)  indicates  a  p-value  of  less  than  1  x  10-40 
and  “N/A”  indicates  that  there  were  no  test  examples  for  the  associated  component. 


Results:  MvirDB 

MvirDB  is  the  database  of  virulence  genes  compiled  at  Lawrence  Livermore  National  Labs.  We  used  it 
as  an  independent  test  set  for  our  SVM  method;  seeing  which  MvirDB  genes  that  were  not  part  of  our 
initial  training  set  showed  up  as  positive  examples  according  to  the  SVM.  Results  are  good. 


Class 

AUC 

Pathogenicity /virulence 

0.941  (2.822  x  10“^) 

Secretion 

0.874(2.133  x  lO"1) 

Iron  Permeases 

0.767  (2.523  x  10“ L) 

Surface  antigens 

0.942  (1.829  x  lO"8) 

Adhesins 

0.972  (4.091  x  10-5) 

Invasins 

N/A 

Toxins 

0.890  «  1  x  10-4U) 

Hemolysis  genes 

0.918  (2.061  x  10-12) 

AUC  for  all  8  components  of  the  MvirDB  experiment. 


What  we  get  is  different  from  BLAST 

It  would  not  be  interesting  if  our  SVM  was  identifying  the  same  set  of  genes  that  BLAST  would. 
Fortunately,  we  get  a  very  different  set  of  genes.  Thus  the  SVM  method  adds  value  to  any  method  based 
on  BLAST  and  identifies  a  significant  number  of  positive  examples  BLAST  would  miss. 


Class  (size) 

BLAST 

SVMs 

overlap 

Pathogenicity/virulence  (338) 

60/338 

93/338 

1 6/338 

Secretion  (436) 

1  1 1/436 

70/436 

24/436 

Iron  permeases  (173) 

170/173 

163/173 

160/173 

Surface  antigens  (221) 

41/221 

78/221 

1 6/22 1 

Adhesins  (188) 

50/188 

89/188 

31/188 

Invasins  (136) 

24/136 

20/136 

6/136 

Toxins  (219) 

47/219 

80/219 

19/219 

Hemolysis  genes  (333) 

153/333 

1 1 1/333 

72/333 

Comparison  between  BLAST  and  SVMs. 


Results:  genome  profiling 

•  By  counting  the  number  of  genes  predicted  to  be  associated  with  pathogenicity  in  the  top  .15% 
of  genomes  from  unknown  organisms,  we  can  score  their  relative  pathogenicity. 

Furthermore,  our  SVM  method,  when  run  against  a  whole  genome,  can  generate  a  global  score  for  how 
pathogenic  a  particular  bacterial  species  or  strain  is.  We  look  at  the  top  .15%  scoring  genes  in  the 
genome,  and  generate  a  pathogenicity  score  to  compare  globally  across  genomes. 


Conclusion 


Score 

Organism 

0.2597 

Escherichia  coli  0157:H7  str.  Sakai 

0.1762 

Mycobacterium  tuberculosis  H37Rv 

0. 1 306 

Escherichia  coli  W31 10 

0.1215 

Burkholderia  pseudomallei  668 

0. 1 1 39 

Burkholderia  thailandensis  E264 

0.1002 

Pseudomonas  aeruginosa  UCBPP-PA14 

0.0835 

Pseudomonas  putida  KT2440 

0.0228 

Mycobacterium  smegmatis  str.  MC2  155 

Conclusion 

•  There  are  short  signatures  of  pathogenicity  that  have  functional  implications 

•  These  methods  are  orthogonal  to  BLAST ;  produce  different  results 

•  Can  help  with  high-throughput  annotation  of  unknown  bacterial  genomes 

•  Can  help  with  Environmental  Sampling 

•  A  list  of  currently  unannotated  genes  that  our  method  predicts  may  have  pathogenic  function  is 
available  on  request. 

Our  next  task  is  to  use  this  to  find  pathogenic  islands. 

Detect  Inserted  Foreign  DNA 

•  Goal:  Identify  when  a  string  of  foreign  DNA  has  been  artificially  inserted  into  a  host 


•  Approach:  Use  methods  of  unsupervised  anomaly  detection 

•  Intuition:  Foreign  DNA  should  have  anomalous  codon  bias,  compared  with  host  organism 

•  Methods 

-  Distance  from  Centroid 

-  One-class  Support  Vector  Machines 

-  Compression-based  Methods 

•  Results:  Unsatisfactory.  Host  organisms  themselves  contain  large  amounts  of  ‘anomalous’  DNA 
due  to  horizontal  gene  transfer 

Detecting  Phylogenetic  Outliers 

•  Target:  a  familiar  genome  with  foreign  genes  inserted 

•  Rationale:  e.g.,  insertion  of  pathogenic  genes  from  anthrax  in  the  genome  of  a  common,  easily 
spread  bacterium.  Also  occurs  naturally  (LGT),  but  rarely,  so  would  focus  attention  on  a  small 
subset  of  genes  including  malicious  insertions. 

•  Traditional  approach  (e.g.,  Lerat  et  ah,  PLoS Biol.,  2003)  relies  on  phylogenetic  tree 
construction,  which  can  be  done  in  many  different  ways,  each  with  its  own  limitations. 

Our  goal  is  to  identify  genes  whose  evolutionary  history  appears  different  from  the  rest  of  the  genes  in  a 
genome.  This  will  serve  to  focus  our  attention  on  genes  that  might  have  been  maliciously  inserted  from 
another  organism,  as  well  as  on  genes  whose  history  is  different  due  to  natural  causes  (lateral  gene 
transfer,  or  LGT).  Thus,  our  algorithm  may  be  of  independent  interest  as  a  complementary  way  to 
detect  LGT.  In  conjunction  with  our  predictor  for  pathogenicity,  this  method  may  identify  malicious 
engineered  sequences. 

•  Idea:  if  a  gene  is  inserted  from  a  foreign  organism,  its  position  in  the  tree  will  appear  to  have 
moved  significantly. 

•  Our  approach:  use  distance  rather  than  trees  to  find  these  outliers 

•  In  this  example,  pairwise  distances  from  gene  3  in  species  E  to  gene  3  in  species  A-D  will  all  be 
unusual. 

•  Tested  on  E.  coli  genome  with  simulated  horizontal  gene  transfer  (swapping  genes  in  from  other 
proteobacteria). 

Often,  LGT  is  currently  detected  using  tree-based  methods.  The  problem  is  that  constructing 
phylogenetic  trees  is  slow  (not  suitable  for  scanning  whole  genomes,  generally)  and  sometimes 
incorrect.  We  can  solve  our  problem  using  the  distance  data  used  for  tree  construction,  but  without 
actually  building  the  trees.  This  makes  our  approach  faster  and  avoids  some  of  the  errors  that  tree 
reconstruction  methods  can  make. 


Results 


•  Overall  sensitivity:  46%.  Specificity:  hard  to  assess  because  right  answer  unknown,  but  we 
predict  a  comparable  rate  of  LGT  to  previous  methods. 

•  Our  accuracy  is  very  high  in  two  crucial  cases: 

•  Species  not  too  close  to  E.  coli 

•  Rapidly-evolving  species 

•  where  tree  methods  fail 

•  95%  sensitivity  finding  insertions  from  species  with  up  to  60%  sequence  identity  of  the  original 
genome. 

To  test  our  methods,  we  created  a  data  set  where  we  swapped  genes  into  E.  coli  from  related  organisms 
and  attempt  to  detect  them  using  our  distance-based  approach.  Our  method  is  strong  at  detecting 
swapped  genes  in  two  crucial  cases:  where  the  swapped  sequences  are  reasonably  distant  from  the 
original  ones,  and  when  they  are  swapped  in  from  species  that  are  rapidly  evolving  -  in  which  case  tree 
construction  methods  don’t  do  very  well.  Overall,  we  are  able  to  detect  about  46%  of  the  swapped 
genes,  while  predicting  8-10%  of  the  genome  as  having  non-standard  evolution.  (We  can’t  assess  the 
specificity  this  way,  though,  because  most  of  these  are  probably  not  false-positives;  published  estimates 
of  the  rate  of  LGT  in  bacterial  genomes  range  up  to  15%,  though  we  think  that’s  a  little  high,  so  we’re 
very  happy  with  8-10%  predicted  positives.) 

•  We  succeed  especially  well  when  traditional  tree-based  methods  fail. 

•  Our  method  is  fast  enough  to  be  used  on  entire  genomes  (unlike  tree-based  methods). 

•  Only  other  genome-wide  LGT-detection  method  (DarkHorse)  uses  BLAST.  Comparison  on  E. 
coli  genome  (without  inserted  outliers): 

-  Very  different  results,  but  some  evidence  we  are  right. 

-  We  predict  27  outliers  in  E.  coli  that  they  missed,  including  selB,  thyA,  hscC:  literature 
calls  these  examples  of  HGT. 

We  compare  our  method  to  tree-based  methods  and  find  that  we  find  complementary  things,  plus  tree 
methods  are  too  slow  to  use  on  whole  genomes.  The  only  existing  whole-genome  method  that  we’ve 
found  (for  LGT)  is  DarkHorse,  based  on  BLAST.  We  again  find  that  our  results  are  complementary.  In 
particular,  we  predict  27  E.  coli  genes  have  non-standard  evolution  that  they  fail  to  find.  We  haven’t 
manually  verified  all  of  these  yet,  but  so  far  we  know  that  at  least  three  of  them  have  independent 
published  evidence  for  being  true  examples  of  LGT. 

•  Our  distance-based  method  for  detecting  inserted  genes  (and  LGT)  works  well  in  cases  where 
tree-  and  BLAST-  based  methods  fail. 

•  Our  approach  should  be  used  in  combination  with  these  other  methods  to  identify  potential 
malicious  insertions. 

Our  approach  finds  real  LGT  that  other  methods  miss,  and  are  fast  enough  to  use  on  whole  genomes. 
They  should  be  used  in  conjunction  with  predictors  for  pathogenicity  and  other  LGT  approaches  to 
identify  candidate  malicious  insertions. 


